Automatic Discovery of Semantic Structures in HTML Documents

نویسندگان

Saikat Mukherjee

Guizhen Yang

Wenfang Tan

I. V. Ramakrishnan

چکیده

Template-driven HTML documents posses an implicit, fixed schema denoting concepts and their relationships in a hierarchical fashion. Discovering this schema remains a relatively unexplored problem. By exploiting a key observation that semantically related items in HTML documents exhibit spatial locality, we develop an algorithm for automatically partitioning them into tree-like semantic structures which expose the implicit schema.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Transforming Arbitrary Tables into F-Logic Frames with TARTAR

The tremendous success of the World Wide Web is countervailed by efforts needed to search and find relevant information. For tabular structures embedded in HTML documents typical keyword or link-analysis based search fails. The Semantic Web relies on annotating resources such as documents by means of ontologies and aims to overcome the bottleneck of finding relevant information. Turning the cur...

متن کامل

Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis

Although RDF/XML has been widely recognized as the standard vehicle for representing semantic information on the Web, an enormous amount of semantic data is still being encoded in HTML documents that are designed primarily for human consumption and not directly amenable to machine processing. This paper seeks to bridge this semantic gap by addressing the fundamental problem of automatically ann...

متن کامل

Reverse Engineering for Web Data: From Visual to Semantic Structures

Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of ”legacy” data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary. This paper describes...

متن کامل

A Formal Ontology Discovery from Web Documents

The huge amount of documents distributing over WWW can be regarded as easily accessible resources of domain-specific knowledge. However users can be also annoyed with the quantitative enormousness, qualitative irregularity, and unfamiliarity of contents of the documents arising from easy accessiblity to specific domains and unstructuredness of WWW. One of the possible solutions to this problem ...

متن کامل